70 


First USA-JAPAN Computer Conference, 1972 


THE KANJI SYSTEM : A NEW METHOD FOR INPUTTING 
JAPANESE SENTENCES 


Hideo Hirahara, Kiyoshi Kikuchi, Kenichi Mori, and Kimihito 
Takeda 

(Toshiba Research and Development Center, Tokyo Shibaura 
Electric Co. , Ltd. , Kawasaki, Japan) 


1. INTRODUCTION 
1. 1 Problem 

The main difficulty in computer processing of 
Japanese information has been the lack of adequate 
input devices for Japanese sentences. Practical 
output devices of Japanese sentences are available 
for computer users from local manufactures. 

Although there are existing input devices, none 
of them are "easy to use". "Easy to use" is difficult 
to define precisely, however, restricting the mean¬ 
ing to the use of keyboard devices, we assume the 
definition as follows : 

1) To input a large character set of Kanji plus Kana 
with a fairly small keyboard. 

2) To monitor input characters (including Kanji and 
Kana). 

3) To correct errors easily. 

4) To input Japanese sentences with no need for 
special training. 

Japanese sentences consist of Kanji, Kana and 
numeric characters. The Kana set consist of 48 
characters and is divided into two cases just as 
English is. The upper case is called Katakana and 
the lower case is called Hiragana. 

On devices that have small character sets and 
both cases can not be used. Katakana is usually 
used. The Japanese people make no distinction 
between Hiragana and Katakana. 

When a sentence is written in Kana, it is very 
difficult to read and understand. Conventionally we 
write a sentence as one continuous string of charac¬ 
ters with no spaces. It is a difficult task to distin¬ 
guish the start and end of the words. Even after we 
separate it correctly into words, there still remains 
the problem of a string of sounds not having a unique 
meaning. For example, there are at least 70 Kanji 
characters for the pronunciation "Shou", in Katakana 
" V 3 £ ", such as _h, # , ® See Fig. 1. 

If Kanji characters are used in a sentence. 


there is no ambiguity in word meaning. 

1. 2 Some Existing Input Devices 

1) Kanji-Teletype-Writer This is the most 
generally used devices, it has 192 keys with 13 
shifts, by which we can code Kanji directly. To get 
one Kanji code, we search for the key on which the 
object Kanji is loaded. Each key has 13 characters, 
one for each shift. After selecting the key that 
contains the object character, we must then select 
one of the 13 shift keys. This device is difficult to 
use, and has the following disadvantages. 

* We must move our arms to select keys. 

For ordinary type-writers, we select keys by 
finger motion, and we have a "home-position" to 
make selection easy. 

* We must memorize the Kanji array. 

We must recall the location of 2, 000 Kanji char¬ 
acters. The location has no meaning, so it is a 
difficult task to remember because they have no 
meaning. 

* We can not visually verify the input Kanji. 

The Kanji-Teletype-Writer, produces only a 
paper tapes, so we can not detect errors after 
printing the paper tape. To correct errors, we 
must first find the errors on the paper tape which 
is difficult and subject to errors. A number of 
interactive Kanji input systems have been design¬ 
ed and implemented. Two are described briefly 
below. 

2) Synotype This was developed by Caldwell 
in 1959. This system uses two principles concern¬ 
ing Kanji. First, Kanji character are constructed 
from 21 kinds of strokes. Second, each Kanji has 

a unique combination of strokes. So if we have a 
Kanji stroke combination dictionary, we can form 
all Kanji characters with a 21 key type-writer. 

It takes an average of 6 strokes to identify one 
Kanji character. The ordinary Japanese person 
does not know the correct stroke combination to 
form most Kanji characters, especially the younger 
generation. We can not expect them to know the 
correct Kanji stroke combination. 


Session 3-2-1 


71 


3) Synowriter This system was developed by 
IBM. Kanji input characters are made by reformed 
flexo-type. The system concepts are as follows. 

1. Kanji characters consist of a limited number 
of fundamental patterns. 

2. To classify patterns of a Kanji characters : 

1st, make a distinction in the pattern of upper- 
half and lower-half of the Kanji character. 

2nd, classify each half pattern as one of the 36 
standard patterns. 

3. Each key is associated with one of the standard 
patterns. 

4. The operator selects, 1st, the upper-half pattern 
key, and then, the lower-half pattern key. 

This operation, selects at most 16 Kanji char¬ 
acters which appear on a CRT. The operator then 
selects one of them. We think this system is design¬ 
ed for foreigners who do not understand Japanese or 
Chinese characters. 

4) Hand-writing Input by Rand-Tablet This is 
a hand-written Kanji character input method. The 
idea is similar to the synotype. Hand-written 
strokes are identified, as to the kind of strokes they 
are. The stroke sequences and the number of 
strokes are collated and matched with a dictionary 
and the matched result appears on a CRT. 

5) Automatic Conversion of Kana strings into 
normal Japanese sentences These translate Kana 
strings into normal Japanese sentences consisting 
of Kanji and Kana. There are a number of experi¬ 
mental system now in use in Japan. The systems 
have been developed mainly by academic institute 
and corporations. 

The system must analyze the input string 
format and specify which substrings should be 
translated into Kanji. 

The input strings must be segmented word by 
word, or phrase by phrase. Two problems must 
be solved to do this : 

(1) Auto-segmentation and identification of part of 
input string to be translated into Kanji. 

(2) The multiple Kanji character have identical Kana 
strings. 

The Kana strings have linguistic properties that can 
be useful in developing algorithms for segmentation. 
The correct Kanji character from among multiple 
possibilities can sometimes be determined by gram¬ 
matical analysis. When this is not possible, all the 
Kanji characters that match are displayed and the 
correct one manually selected. Tins system 
requires large reference files. The advanced 
systems can translated about 70~80% correctly 
with about a 5 % error rate. 


6) An Experiment in Kyoto University Sub- 
patterns of Kanji and their positional relations are 
used to input Kanji and to generate Kanji characters 
in Kyoto University. About 150 sub-patterns and 
10 operaters are used to described Kanji characters. 
Kanji characters are described in list expressions. 

The main purpose of this experiment is not to 
input Japanese sentence with a keyboard, but to 
investigate pattern recognition through Kanji char¬ 
acters. 

2, THE KANJI SYSTEM 

In this chapter, we discuss our method of 
inputting Japanese sentence using Kana keyboard. 
Kana part of the sentence make no problems on 
input. For the Kanji part of the sentence, we 
retrieve Kanji characters that match the pronuncia¬ 
tion described in Kana. 

2. 1 Bunkai-Hatsuon-Shiki Kanji Input 

2. 1. 1 Some Property of Kanji Pronunciation 
Kanji characters have an average of 3 different 
pronunciations. We divide the pronunciations into 
two category; "ON" the Chinese pronunciation and 
"KUN" the original Japanese pronunciation. Many 
Kanji characters have the same "ON" pronunciation 
while the Kanji characters with the same "KUN" 
pronunciation are fewer. 

As shown in Fig. 1, more than 7 0 Kanji char¬ 
acters can be pronunced "Shou". It is very difficult 
to choose the correct one from among the 70 on a 
CRT. Most Kanji characters have a number of 
pronunciation, so if we add a second pronunciation 
that is not used in the context, to the input the 
probability of a number of Kanji character matching 
both pronunciations is reduced and becomes less 
serious. This is a fundamental concept to input 
Kanji by its pronunciation. We call this method 
'' Bunkai - Hatsuon". 

2, 1. 2 An Evaluation We assume comparison 
criteria as deliminated ambiguity each key touch, 
that is, the amount of information to retrieve Kanji. 
We compare the effectiveness of some system with 
our method. We take Synotype for example. 

Tests have shown that when the Kanji system 
is used, an average of 4 stroke will uniquely 
describe a Kanji character. When the Synotype is 
used, it takes an average of 6 stroke to uniquely 
describe a Kanji character. The average effective¬ 
ness of Kanji system is therefore about 30 - 40 % 
more than the Synotype. 

2. 1, 3 Some Property of "Kanji-Jukugo" 

Kanji characters are semantic characters. Each 
character has its own meaning. Multiple Kanji 


Session 3-2-2 




72 


First USA-JAPAN Computer Conference, 1972 



number of key operations 

O -O Pronunciation 

•-• Stroke spelling 


characters can generate new words, such as 14^, 

44 means "class" or "category", ^ means "learn¬ 
ing" or "school" and the word fi'-A means "science". 
We call these generated words Kanji-Jukugo. Kanji- 
Jukugos are combined with other Kanji-Jukugo or 
Kanji characters to produce new Kanji-Jukugos. 

Such as S4G which means "technology" and If which 
means "bureau" to form which means 

"National Bureau of Science and Technology". 

If we treat a Kanji Jukugo as a unit on input, 
we can get more effective input performance, 
because; 

1. Kanji characters or Kanji-Jukugos with identical 
phonetic input string become fewer. 

2. Less input redanduncy is required. Consider the 
following for example. 

When we input ( ) "Kagaku" for we 

will find both 44^ and match. We can find 
uniquely 444^ by adding ( 1*') "Toga". "Toga" is 
another pronunciation for 44 . If we put * "ka" for 
fl-, we will find 39 Kanji characters such as IS , it, 

44, jt, lk , , K. To get 44 uniquely, we 

must add "Toga" or choose from the 39 Kanji char¬ 
acters. 

2. 1. 4 Kanji-Jukugo Retrieval LetYj, 

Y 3 be Kana input string. Where Y j is the pronunci¬ 
ation of the object Kanji-Jukugo. Y 2 and Y 3 are 
pronunciations of Kanji characters contained in the 
Kanji-Jukugo, and are pronunciations other than 
that used in the Kanji-Jukugo. Y 2 or Y 3 can be null. 

For the input string Yj, Y 2 , Y 3 search the 
Kanji-Jukugo dictionary first with Y 3 . This first 
reference requires the full dictionary. From the 
result of first retrieval, check if a Kanji character 
or the Kanji-Jukugo match Y 2 and Y 3 . 

According to our experience, by adding Y 2 
there remain no more than three possible Kanji- 
Jukugos. See Fig. 1 and Fig. 2 for example. Fig. 1 
shows all Kanji characters and Kanji-Jukugos which 


have a pronunciation of ( -yap ) "Shou" as Yj. Fig. 

2 shows that only two Kanji characters, both have 
the pronunciation ( -y a ) "Shou" as Y| and (=> ) "Ko" 
as Y 2 . 



nx 

1 ' - v 

■ < 

X / u 

<KK’J 




" 1 

L .1 j 

" 1 

]_ 

, Jl, 

Ik, 

1^' X 

in. 

fill X 

m. 

I'l. 

n 

11 , 


KV| 

. nib 


M , 

fi x 

n. 

7, 

'}/ , 

l"| „ 

/». 

!A 



fl X 

\’p. 

-t. 

Id. 

7, 

lk. 

w. 

■ 1 

1 t 

, Hu, 

1^1 

1 1 

■ III X 

1'A, 

ll x 

?.'! x 

irt. 

i'y , 

m. 

Ji, 

, %. 

IKL 

'\ X 

Vi x 

/i,S 

Tl X 

til. 

irt. 

6«. 

n. 

fl. 

A 

. V , 

x, 

fiL 

fiix 

Pi x 

fi 1 .. 


Dll 

v , 


ill 


aA, 

.It X 

,T X 

1, 

n x 

Vi 

m. 

-t 

ft x 









Fig. 1 


nrff < X7°u <$sij 
['K J&] 


Fig. 2 

2, 1. 5 Kanji-Jukugo Dictionary The Kanji- 

Jukugo dictionary is organized as follows : 

1) The Kanji-Jukugo dictionary consist of a Kanji 
character dictionary and a Kanji-Jukugo dictiona¬ 
ry. 

2) Index words of the Kanji-Jukugo dictionary are 
pronunciations of the Kanji-Jukugos expressed in 
Kana. Contents of a Kanji-Jukugo item are Kanji 
character code strings. 

3) The contents of the Kanji character dictionary are 
by Kanji character code given to the Kanji char¬ 
acter expressed in Kana and subpattern informa¬ 
tion along with the total number of strokes. 

An example of this is shown in Fig. 3. 


Kanji-Jukugo Dictionary 


pronunciation 

Kanji code string 

3 •> -fc -i 

®0704, ® 1617 


Kanji Dictionary 


corresponding 

code 

pronunciations 

other informations 

Kanji 



a 

® 07 04 

3 a , f , * * -f +r 

5, A 

E 

® 1617 

-fc -f , * f V 4 -9 

5 




(T) 0704 is the code for 5k 
(2) 1617 is the code for IE 


Fig. 3 


Session 3-2-3 








73 


As shown in 2. 1. 3, Kanji-Jukugo can be gen¬ 
erated by concatenation. So, the string length of 
Kanji-Jukugos can be 2, 3, 4, 5, 6 , • • • • , but the 
basic Kanji-Jukugos, from which we may generate 
new Kanji-Jukugo by concatenation, are not longer 
than four Kanji characters, therefore the longest 
length of Kanji-Jukugos allowed in the dictionary is4. 

2. 1. 6 Auto Segmentation of Concatenated 
Kanji-Jukugo into Basic Kanji-Jukugo When Yi 
does not match a Kanji-Jukugo in the dictionary 
because Yi is longer, the auto segmentation pro¬ 
gram breaks down Yj. This algolithm depends on 
some linguistic property of Kanji-Jukugos. The 
properties are as follows : 

1) The pronunciations of Kanji-Jukugos are "ON" 
with few exceptions. 

2) The rule of "ON" pronunciations of Kanji-charac¬ 
ters is as follows : 


Let 

a, 

b, e ; 

, f, 

and 

X 

be a set of Kana characters 

as 







a = 

(y 

■ -f , 

<7 , 

4* , 


, r . y) 

b = 

(* 


- ) 




e = 

(* 

. ¥ . 

i/ , 

-2 . 


, = , fc . $ . v ) 

f = 

(■> 

, 9 , 

V , 

y) 



x = 

( any other 

Kana 

characters) 


Then the "ON" pronunciation of almost all the 
Kanji character will be one of the followings, 
x 
xa 
eb 
ebf 

By this property, almost all the Kanji-Jukugos 
can be separated into basic Kanji-Jukugos. 

After this segmentation, Yj becomes Yi' and 
Y 2 " • • • , and the dictionary is referenced again. 

2, 2 System Description 

The hardware configuration of the Kanji Sys¬ 
tem is shown in Fig. 4. The computer is a TOSBAC- 
3400 Model 41 with 64 K words (24 bits/word). 
Display information is transfered by Kanji code. 



Kanji Display DDS-150 Computer System 

TOSBAC- 3400 Model 41 


Fig. 4 


2. 2. 1 Kanji Display DDS-150 A CRT is used 
to display the Kanji characters or any line-dot 
patterns. This Kanji display has two operation 
mode for Kanji display and Raster mode for line-dot 
pattern display. 400 characters can be displayed in 
Kanji modes and 128 x 240 points (each point con¬ 
sists of 4 dots) can be displayed in Raster mode, 
see Figs. 6 and 7 . There are four kind of points 
used as components. They are blank, dot, vertical 
line, or horizontal line. 2526 character patterns 
are stored in a Miler-Sheet ROM. 64 additional 
characters can be stored in the write-able control 
storage of DDS-150 so that at one time a total of 
2590 characters are stored in memory. 

2. 2. 2 Programming The control programs 
of the Kanji System are written in an extended 
FORTRAN with capabilities for character and string 
manipulation and bit operation. The organization of 
the Kanji system programs is shown in Fig.5. 



Fig. 5 

2. 2. 3 Operation 

1) Operation sequence to input the sentence " SJE4 

WWfiW L . ■ ■ " is as follows : 

1. Shift key for Kanji-Jukugo 

2. Type Kana keys "ad-tj" for "■£;]£". If the 

operator pushes "End of Kanji-Jukugo Infor¬ 
mation" key, 16 Kanji-Jukugo which are pro¬ 
nounced "ant" will be displayed on the 
CRT. If the operator knows and adds informa¬ 
tion such as " ^ It " , which is another pronunci¬ 
ation of "jE", then " uniquely appears on 

the CRT. 

3. Type Kana key Then appears follow¬ 

ing "£1E". 

4. Shift key for Kanji- Jukugo. 

5. Type Kana keys " ^ypy " for "fUJgf". 

6. "End of Kanji-Jukugo Information" then 
will appear uniquely. 

7. " for 

8. Shift key for Kanji-Jukugo. (There is no dis¬ 
tinction between Kanji characters and Kanji- 


Session 3-2-4 











74 


First USA-JAPAN Computer Conference, 1972 


Jukugos for the operators. ) 

9. " 3 V , ii ■)-" for "jg " , then 11 jg" will appear. 


2) Editting 

Error correction and editting capabilities are also 
available in the Kanji system. Inserting, deleting 
and changing characters at any position are possi¬ 
ble for inputting sentences and for stored files. 
Print-out format can be assigned also interactive¬ 
ly. See Fig. 7. 



3, CONCLUSIONS Fig. 6 

Section 1. 1 described the need and require¬ 
ments for an "easy to use" Kanji character input/ 
output system. The Kanji system comes close to 
our definition of an "easy to use" device. 40- 60 
character per minute input has been achieved. This 
speed is not fast enough for some purpose, but it 
satisfies the requirement for some man-machine 
communication system. The Kanji system provides 
a frame work for natural language communication 
between people and computers, and this method can 
be used for any size Kanji character set. No major 
modification is required for extension of the set we 
only need to add the new Kanji characters to the 
dictionary. 

Our future experiments are concerned with a 
question-answering system with reasonably natural 
Japanese. 

BIBLIOGRAPHY 

1. Samuel H. Caldwell, " The Sinotype • • • A Machine 
for the Composition of Chinese from a Keyboard", 
Journal of The Franklin Institute, Vol. 267, No. 6, 
pp. 471-502, June, 1959. 

2. Gilbert W. King and Hsien-Wu Chang, "Machine 
Translation of Chinese", Scientific American, 

Vol. 208, pp. 124-135, June, 1963. 


3. Gabriel F. Groner, J. F. Heafner, and T. W. 
Robinson, "On-line Computer Classification of 
Handprinted Chinese Characters as a Translation 
Aid", IEEE Transactions on Electronic Comput¬ 
ers, Vol. EC-16, No. 6, December, 1967. 

4. Hideyuki Hayashi, Sheira Duncan, and Susumu 
Kuno, "Graphical Input/Output of Nonstandard 
Characters", Communications of the ACM, Vol. 
11, No. 9, September, 1968. 



5. 

3 £B±£:W8 9 6 

6. ft*fife ©•&$;" 

BS 4 3 

7. m, Pfflffi” 

m 4 4 7 4 

a m. 

B0 4 6 1116 

9. fit 

A ' BB 4 7 12 6 7 

10. m, ¥®tife” 

ffi 4 7 12 6 6 

11. 4 -X T - ^ 4 ’ 

1 9 7 2. 3 pp 109-1 15 


Session 3-2-5 










